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homogeneity assessment can be of value to any study, but depends on the identification 
of individuals with certain features based on some distinguishing genetic characteristics, 
such as forensic applications. 

Analyses assessing the similarity in the genetic profiles of individuals have been 
5 pursued. For example, polymorphic microsatellites (primarily CA repeats) have been 

used to construct trees of human individuals that reflect their geographic origin 
(Bowcock et al. 9 Nature 368:455-457, 1994), and to study the genetic variability within 
and between cattle breeds (Ciampolini, et al.,J. Anim. Sci. 73:3259-3268, 1995). RFLP 
genotypes have been used to construct trees of individuals of different ethnicities 
10 (Mountain and Cavalli-Sforza, Am. J. Hum. Genet 61:705-718, 1997). Random 

amplified polymorphic DNA (RAPD) markers have been used to compute genetic 
similarity coefficients (Lamboy, PCR Methods and Applications 4:31-37, 1994), and to 
compare phenotype and genotype in plants (Jasienski, et aL, Heredity 78:176-181, 
1997). 

15 However, these analyses often rely on a priori knowledge of the groups to which 

the individuals belong. Many do not permit the determination in the absence of a priori 
knowledge of which, and to what degree, different populations may have contributed to 
the genetic variation within a pool or sample of individuals. However, in the large 
majority of cases, individuals sampled from a population represent an "admixture" of 

20 genes from several populations. These populations are reflected in the genetic profiles 

of individuals and hence can defy population segregation based on traditional markers 
such as skin color and/or self-reported ethnic affiliation. Therefore, methods of analysis 
are needed to accurately determine the existence of clusters of genetically similar 
individuals, absent phenotypic (ethnic, for example) information. As noted previously, 

25 knowledge of the homogeneity or heterogeneity of a population can be important under 

many circumstances including forensics and population-based studies. 

In forensics, DNA fingerprinting requires the computation of 'match 
probabilities' between the suspect and the DNA obtained on a victim. Match 
probabilities are often computed relative to a database of non-suspect DNA. The utility 

30 of the DNA contributed by non-suspects will be influenced by the amount of genetic 

heterogeneity among the non-suspects (Jin & Chakraborty, Heredity 74:274-285, 1995; 
Sawyer et aL, Am. J. Hum. Genet. 59:272-274, 1996; Tomsey et aL, J. Forensic Sci. 
44:385-388, 1999). Thus, determining the heterogeneity of the non-suspect population 
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similarity for a new cluster; h) applying a non-hierarchical clustering algorithm to said 
ordered set of similarity data using said optimal number of clusters; i) determining the 
relatedness between pairs of homozygous pairs by performing a paired-pair analysis on 
the clusters resulting from said non-hierarchical clustering algorithm, wherein the 
5 homozygous loci of two pairs are compared pairwise to determine whether the pairs 

share the same homozygous alleles on the same loci; j) summing said paired-pair 
comparison for one pair versus all pairs in a cluster; k) computing the average sum of 
said paired-pair comparison for all pairs in said cluster; 1) assigning values to the 
homozygous relatedness of each member of a pair to all homozygotes in said cluster 

1 0 based on whether said sum for one pair is greater than or equal to said average sum of all 

pairs, or whether said sum for one pair is less than said average sum for all pairs; m) 
comparing the number of times said sum for one pair is greater than or equal to said 
average sum of all pairs with the number of times said sum for one pair is less than said 
average sum for all pairs for each individual in said cluster; and n) dividing said cluster 

1 5 into a first cluster and a second cluster if there is: a first group of members of said 

cluster wherein said number of times said sum of one pair is less than said average sum 
of pairs is greater than or equal to said number of times said sum for one pair is greater 
than or equal to said average sum for all pairs, and a second group of members of said 
cluster wherein said number of times said sum of one pair is less than said average sum 

20 of pairs is less than said number of times said sum for one pair is greater than or equal to 

said average sum for all pairs, and wherein said first group of members are placed into 
said first cluster and said second group of members are placed into said second cluster. 

In preferred embodiments of the invention, said traits are genetic loci and said 
values are assigned to said traits based on the alleles of said genetic loci. Preferably, 

25 said values are: 0 when a pair of members share no common allele; 1 when a pair of 

members share a common allele; and 2 when a pair of members share two common 
alleles. Preferably, said weights are assigned based on: sharing rare alleles between a 
pair of members; and sharing a homozygous genotype between a pair of members. 

In other preferred embodiments, said ordered set of similarity data is present in a 

30 similarity matrix. Preferably, said similarity matrix is formed based on the pairwise 

similarity measure: 
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pairwise to determine whether the pairs share the same homozygous or heterozygous 
alleles on the same loci : 

L 

7=1 

where a t =l when two sets of pairs have the same homozygous or heterozygous alleles on 

the same loci, otherwise aj =0; 

where L denotes the total number of loci; 

where k and 1 each represent different pairs of individuals in a particular cluster; and 
where k=l,...,N, and 1 = 1 , . . N, where N is the number of individuals in particular 
cluster. 

Subsequently, the average score (sum of Z k j) of said paired-pair comparison for 
all pairs in a cluster is computed: 



N 

g _ i>j=Wj*M) 

M 

where M is the number of permutations of paired-pairs, and W w is the sum of the paired- 
pair comparison of one pair versus all pairs in a cluster. 

Subsequently, the sum of the comparison of one pair versus all pairs in a cluster 
is compared with the average sum for all pairs in order to assign a value to the 
homozygous or heterozygous relatedness of each member of a pair to all homozygotes or 
heterozygous in a cluster: if, 

W kl > Z then 

l ci : nh = I 

for each k>l= 1 , . . . ,N; and i=k,l 

la.-,_=l 

W kl < Z then 

where ai 0 indicates that the individual's score for W is "below" the average score for the 
cluster, and 

where a ob indicates that the individual's score for W is "above" the average score for the 
cluster. 





= 0 


a i,ob 


= 1 


a Uo 


= 1 


a i,ob 


= 0 
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